VAL_OTU_ID is a unified pipeline for validating eDNA metabarcoding
taxonomy assignments. It addresses a fundamental limitation of standard
taxonomy assignment tools like DADA2’s assignTaxonomy():
they ignore geographic context and reference database
completeness.
When DADA2 assigns taxonomy, it finds the best-matching sequence in the reference database. However, this is a sequence analysis exercise that ignores species ranges and potential downfalls of the completeness of the reference database, and interactions between these two facets:
Species can also occur outside their established ranges if they are Invasive or Alien species (IAS). In the case of IAS, such validations may be confounded, so the algorithm to address this problem, described below, also checks validated identifications and original identifications against IAS databases and the IUCN red list database to flag any ASV/OTU identification that may warrant closer scrutiny.
VAL_OTU_ID validates each ASV/OTU assignment by:
.rds
file or CSV with species assignments.rds)Easiest use to get started is: 1. Get taxonomic assignments of ASVs/OTUs using a validated reference database (e.g. MIDORI or SILVA) and the assignTaxonomy() function in DADA2 2. Construct a phyloseq object with a minimum of the otu_table, tax_table, refseq slots filled
Alternatively, a csv file of the taxonomic assignment of each ASV/OTU and a fasta file of the the sequences of each ASV/OTU can be used.
| File | Description |
|---|---|
*_geographic_validated.csv |
GBIF/OBIS occurrence results for each species |
*_database_validated.csv |
MIDORI2 sequence comparison results |
*_combined_decisions.csv |
Full results with decision logic |
*_phyloseq.csv |
Simplified output ready for R import |
For each unique species in your taxonomy, the pipeline queries GBIF and OBIS using a tiered bounding box approach:
The bounding box is defined for the study area and can be specified by the user
| Category | Definition | Implication |
|---|---|---|
| CONFIRMED | Records exist within the study area | Species occurs here - ID plausible |
| PLAUSIBLE | Records within buffer zone (default 500km) | Near known range - ID likely |
| POSSIBLE | Records within threshold (default 2000km) | Edge of range - needs scrutiny |
| DOUBTFUL | Nearest record beyond threshold | Likely misidentification |
| NO_RECORDS | No occurrence records found globally | Unknown species or data gap |
For species that are not CONFIRMED, the pipeline also searches for congeners (same genus) in the study region. This information is used for:
For each ASV, the pipeline:
The pipeline uses three methods in order of preference:
⚠️ Important: Install vsearch for accurate results. K-mer similarity is NOT equivalent to true sequence identity.
| Flag | Definition |
|---|---|
| CONFIDENT | Assigned species matches ≥99% |
| LIKELY | Assigned species matches well (97-99%) |
| UNCERTAIN | Some regional congeners missing from MIDORI2 |
| REASSIGN | A congener matches >0.5% better than assigned |
| NO_REF | Assigned species not in MIDORI2 |
| NO_CONGENERS | No other congeners in region (monotypic genus) |
This is where the magic happens. The pipeline combines geographic and database evidence to make a final decision.
The pipeline will ONLY reassign to a congener that is geographically CONFIRMED or PLAUSIBLE in your study area.
This prevents incorrect reassignments where a non-local species happens to match better in the database. For example:
| Scenario | Assigned Species | Best DB Match | Result |
|---|---|---|---|
| Both local | P. virens (98.8%, CONFIRMED) | P. pollachius (100%, CONFIRMED) | REASSIGN to P. pollachius |
| Best match not local | C. harengus (98.8%, CONFIRMED) | C. pallasii (99.4%, NOT LOCAL) | KEEP C. harengus |
| Neither local | P. stellatus (97%, DOUBTFUL) | P. flesus (97.5%, CONFIRMED) | REASSIGN to P. flesus |
Legend: MIDORI2 = sequence reference database checks. GBIF/OBIS = geographic occurrence checks. Green = confident ID. Yellow = dropped to higher rank.
The original DADA2 assignment is retained. This happens when:
The ASV is reassigned to a different species. This ONLY happens when:
reassign_diff_pct (default 0.5%) OR
it’s the only local congenerThe species-level ID is removed, keeping only genus. Output shows
Genus or Genus sp. cf. species. This happens
when:
Rare cases where sequence identity is too low for genus-level confidence.
cf. (Latin: confer, meaning “compare with”)
indicates uncertainty at the species level while suggesting the most
likely identity.
Format: Genus sp. cf. epithet
Example: Platichthys sp. cf. flesus
This means: “A Platichthys species, probably flesus but not confirmed”
Conditions for cf. notation:
| Parameter | Description |
|---|---|
--input, -i |
Phyloseq .rds file OR base name of
_taxonomy.csv/_sequences.fasta |
--midori, -m |
MIDORI2 reference FASTA (filename or full path) |
--bbox, -b |
Study area bounding box:
minLon minLat maxLon maxLat |
| Parameter | Default | Description |
|---|---|---|
--taxonomy, -t |
- | Full path to taxonomy CSV (alternative to –input) |
--sequences, -s |
- | Full path to sequences FASTA (alternative to –input) |
--species-column, -c |
species_raw |
Column name containing species names |
--output-dir, -o |
results_files |
Output directory |
--starting-dir |
starting_files |
Directory for input files when using –input |
--reference-dir |
Reference_databases |
Directory for MIDORI2 when filename only given |
| Parameter | Default | Description |
|---|---|---|
--buffer-km |
500 | Buffer around study area for “PLAUSIBLE” classification |
--distance-threshold-km |
2000 | Maximum distance for “POSSIBLE” classification |
--congener-search-km |
2000 | Search area expansion for finding congeners |
| Parameter | Default | Description |
|---|---|---|
--min-species-pct |
97.0 | Minimum % identity for species-level ID |
--min-genus-pct |
90.0 | Minimum % identity for genus-level ID |
--min-family-pct |
80.0 | Minimum % identity for family-level ID |
--cf-threshold-pct |
97.0 | Minimum % identity for sp. cf. notation |
--reassign-diff-pct |
0.5 | Minimum % difference to reassign when multiple local congeners |
| Parameter | Default | Description |
|---|---|---|
--max-concurrent |
10 | Maximum concurrent API requests |
--worms-rate-limit |
0.1 | Seconds between WoRMS requests |
| Parameter | Default | Description |
|---|---|---|
--kmer-size |
15 | K-mer size for sequence comparison fallback |
--vsearch-path |
vsearch |
Path to vsearch executable if installed |
These are optional checks that can be enabled with flags:
| Parameter | Description |
|---|---|
--check-iucn |
Enable IUCN Red List status checking |
--check-invasive |
Enable invasive species checking via GBIF/GRIIS |
To use --check-iucn, you need a free API token:
starting_files/iucn_api_token.txtThe pipeline will automatically read the token from this file.
IUCN Red List: Official conservation status from the International Union for Conservation of Nature.
Invasive species: Checked against:
Lookup species: Conservation status is checked
against species_final_redund — the validated final species
name with trailing numbers removed.
cf. notation: For species like “Platichthys sp. cf. flesus”, the pipeline extracts the implied species by removing the “sp. cf.” portion:
Original DADA2 assignment: The original species assignment is ALWAYS checked, regardless of the validation decision:
iucn_category, invasive_statusiucn_category_original,
invasive_status_originalThis is important because:
invasive_status_originalThe species_concern_flag reflects whether EITHER the
final OR original species is threatened/invasive — so you never miss
important conservation information.
The sequence identity thresholds should be adjusted based on your amplicon length. A single nucleotide difference has different percentage impacts:
| Amplicon | ~Length | 1 nucleotide = | Suggested min_species_pct | Suggested reassign_diff_pct |
|---|---|---|---|---|
| MiFish 12S | 170bp | ~0.6% | 97-98% | 0.5-0.6% |
| 16S V4 | 300bp | ~0.3% | 97% | 0.3-0.4% |
| COI Leray | 313bp | ~0.3% | 97% | 0.3% |
| COI Folmer | 658bp | ~0.15% | 97-98% | 0.15-0.2% |
Download the repo locally and place your starting files (either your phyloseq object saved as a .rds file, or your starting ASV/OTU .csv taxonomy and .fasta sequence files) in the “starting_files” directory inside the repo.
Open your linux prompt (e.g. terminal in MacOS) and navigate to where your VAL_OTU_ID repo is.
Use bboxfinder.com to draw your study area and get coordinates.
Example for Trondheimsfjorden on the mid-Norwegian coast:
minLon: 6
minLat: 63
maxLon: 12
maxLat: 66
Example bounding box selection using bboxfinder.com
Here is an example command with some custom minimum percentages for acceptable species, genus and family level ID. Also, distance-thresholds for plausible IDs and congener search thresholds. There are updates as to the status of the script in the terminal until the script is complete.
python VAL_OTU_ID.py \
--input ps_chordata_CLEANED_TAXONOMY \
--midori MIDORI2_UNIQ_NUC_GB268_srRNA_DADA2.fasta \
--bbox 5 62 13 65 \
--min-species-pct 98 \
--min-genus-pct 92 \
--min-family-pct 88 \
--buffer-km 300 \
--distance-threshold-km 800 \
--congener-search-km 800 \python VAL_OTU_ID.py \
--input ps_chordata_CLEANED_TAXONOMY \
--midori MIDORI2_UNIQ_NUC_GB268_srRNA_DADA2.fasta \
--bbox 5 62 13 65 \
--min-species-pct 98 \
--min-genus-pct 92 \
--min-family-pct 88 \
--buffer-km 300 \
--distance-threshold-km 800 \
--congener-search-km 800 \
--check-iucn \
--check-invasiveNote: For
--check-iucn, place your IUCN API token instarting_files/iucn_api_token.txt
library(phyloseq)
# Load original phyloseq
ps <- readRDS("starting_files/ps_fish.rds")
# Read validated taxonomy
validated <- read.csv("results_files/ps_fish_phyloseq.csv")
rownames(validated) <- validated$ASV_ID
# Check decision distribution
table(validated$validation_decision)
# Filter to confident IDs only
confident <- validated[validated$validation_decision == "KEEP", ]
# Update taxonomy in phyloseq
# (keeping only rows that match)
keep_asvs <- rownames(validated)[validated$validation_decision == "KEEP"]
ps_validated <- prune_taxa(keep_asvs, ps)
# Or update all taxonomy with validated decisions
new_tax <- validated[, c("kingdom", "phylum", "class", "order",
"family", "genus", "species_final")]
rownames(new_tax) <- validated$ASV_ID
tax_table(ps) <- tax_table(as.matrix(new_tax))_phyloseq.csv columns:| Column | Description |
|---|---|
ASV_ID |
Unique identifier for each ASV |
kingdom through genus |
Higher taxonomy (unchanged from DADA2) |
species_final |
Validated species with unique number for each ASV/OTU (e.g., “Gadus morhua 3”) |
species_final_collapse |
Species without number (for aggregating replicates) |
species_final_redund |
Species name with trailing number stripped (for grouping) |
species_original |
Original DADA2 assignment |
habitat |
marine/freshwater/brackish/terrestrial (from WoRMS) |
validation_decision |
KEEP/REASSIGN/DROP_TO_GENUS/etc. |
decision_reason |
Human-readable explanation |
best_db_match |
Best matching species in MIDORI2 |
best_db_match_pct |
% identity to best match |
iucn_category |
IUCN Red List category for final species. For cf. species, queries the implied species. |
iucn_population_trend |
Population trend from IUCN |
invasive_status |
Invasive status for final species: INVASIVE/INTRODUCED/NOT_LISTED |
iucn_category_original |
IUCN category for original DADA2 assignment (all species-level assignments) |
invasive_status_original |
Invasive status for original DADA2 assignment (all species-level assignments) |
species_concern_flag |
Summary considering BOTH final and original: THREATENED, INVASIVE, THREATENED+INVASIVE, or NONE |
| Code | Meaning |
|---|---|
| CR | Critically Endangered |
| EN | Endangered |
| VU | Vulnerable |
| NT | Near Threatened |
| LC | Least Concern |
| DD | Data Deficient |
| NE | Not Evaluated (species not in IUCN database) |
| NA | Not Available (query error) |
| NOT_CHECKED | --check-iucn not specified |
| Status | Meaning |
|---|---|
| INVASIVE | Listed in GRIIS or GISD as invasive |
| INTRODUCED | Listed as introduced/alien species |
| NOT_LISTED | Not found in invasive species databases |
| NA | Query error |
| NOT_CHECKED | --check-invasive not specified |
| Reason | Interpretation |
|---|---|
| “Geographic CONFIRMED + sequence match (99.5%)” | High confidence - species confirmed |
| “Geographic CONFIRMED but sequence only 95.3% (below 98.0% threshold)” | Good location but poor sequence match - dropped to genus |
| “Local congener P. pollachius (100.0%) matches better than assigned (98.8%)” | Reassigned to better-matching local species |
| “Better DB match (C. pallasii 99.4%) not local; keeping 98.8% match” | Better match exists but isn’t local - kept original |
| “Only 1 local congener (P. flesus); sequence 97.5% - cf. notation” | Probable species but uncertain - cf. applied |
| “Geographic DOUBTFUL with 3 local congener(s)” | Wrong species likely - dropped to genus |
| “Multiple local congeners, matches too similar (98.8% vs 99.1%)” | Too close to call - dropped to genus |
Ensure R packages are installed:
The pipeline falls back to k-mer similarity, which is less accurate. Install vsearch:
Check if the better-matching congener is geographically local. The
pipeline now verifies that any reassignment target is CONFIRMED or
PLAUSIBLE in your study area. Look at the decision_reason
column - it will say “not local” if this was the reason.
best_db_match_pct column - is the difference
meaningful?--reassign-diff-pct if too many/few
reassignments## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Oslo
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DiagrammeR_1.0.11
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.38 RColorBrewer_1.1-3 R6_2.6.1 fastmap_1.2.0
## [5] xfun_0.54 magrittr_2.0.4 glue_1.8.0 cachem_1.1.0
## [9] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.30 lifecycle_1.0.4
## [13] cli_3.6.5 visNetwork_2.1.4 sass_0.4.10 jquerylib_0.1.4
## [17] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.5
## [21] bslib_0.9.0 yaml_2.3.10 htmlwidgets_1.6.4 rlang_1.1.6
## [25] jsonlite_2.0.0